
This replication file uses a flat file structure for simplicity, so all files mentioned here should be in the same directory.

#######################
## Replication Steps ##
#######################

First install dependencies:
The Python packages you will need are pandas, matplotlib, numpy, sklearn, pickle, textdistance, itertools, random. You can install any of these packages by running pip install <package name> in your python environment. You will also need functions.py in the same directory.

1) Run Amicus_Bonica_Replication.py. This script performs the Interest Group Ideology experiment, and requires hand_labeled_matches.csv, unrepresented.p, amicus_org_names.csv, bonica_org_names.csv, and all csvs of the form iter#_pairs.csv. Since it takes several hours to compute string distance metrics for the ~30 million string pairs required of each iteration, this script will take about a day and a half to run. When the script reaches the point when the human user is required to label 100 string pairs each iteration, the script instead loads in pre-labeled csvs. Amicus_Bonica_Replication.py will save 10 random forest models in the form of a pickle file, one per iteration. Each will be named iter<iteration#>_model.p. In addition the script will save full_train_set.csv, the processed labeled training set consisting of 71,043 string pairs and iter0_model.p, the model trained on sed dataset, and this dataset and model will also be used in the other two tasks. The script will also save feature_importances_by_iteration.csv, AUC_table.csv, and Final_Results.csv. The feature_importances_by_iteration.csv outputted in this script will only include the information for the iterations from the Interest Group Ideology task. Data from the City Name Correction task will be added when you run Cities_Replication.py.

2) Run Cities_Replication.py (approximately 10 minutes), which performs the City Name Correction task. This script requires 150k_plus_cities.csv, uscities.csv, citis_hitl_pairs.csv, full_train_set.csv, and iter0_model.p, the last two of which are outputs from the Amicus_Bonica_Replication.py script. Like the Amicus_Bonica_Replication script, when it comes time for the human to label 500 string pairs in the human-in-the-loop step the script will automatically load in the pre-hand-labeled pairs. This script will save cities_all_pairs.csv, cities_model.p and Cities_Final_Results.csv. It will also add a row to feature_importances_by_iteration.csv containing the information from the iteration performed in the PPP cities task.

3) Run Incumbent_voting_Replication.py (approximately 2 minutes), which performs the Incumbent Voting Application. This script requires incumbent.csv and iter0_model.p. Since this task does not include a human in the loop iteration, it does not save any files. The script simply reproduces the steps to achieve the relevant string pairs, and we include the hand-labeled output as Survey_Final_Results.csv.

4) Run the four R scripts to produce final figures and tables. Each R script takes less than a minute to run.

The suggested order of running files is:
 Amicus_Bonica_Replication.py
 Cities_Replication.py
 Incumbent_voting_Replication.py
 amicus_analysis.R
 cities_analysis.R
 incumbent_analysis.R
 table1.R


#######################
## File List ##########
#######################


File				Type			Description
150k_plus_cities.csv		Raw Data		PPP loan data, including city names and corresponding states
amicus_analysis.R 			R code: plots	Produces Figure 1, Appendix Figure B2, and Appendix Table B5 
Amicus_Bonica_Replication.py	Python code: Analysis	Performs Interest Group Ideology task
amicus_org_names.csv		Raw Data		Table with names of 13,939 Amicus organizations
AUC_table.csv			Output Data		Table of AUC values per iteration
bonica_org_names.csv		Raw Data		Table with names of 1,332,470 Bonica organizations
cities_all_pairs.csv		Processed Data		All city string pairs in our sample with corresponding distance metrics
cities_hitl_pairs.csv		Processed Data		City string pairs labeled during the human-in-the-loop iteration
cities_analysis.R 			R code: plots	Produces Appendix Tables B.1 and B.2
Cities_Final_Results.csv	Output Data		Results from the City Name Correction task
cities_model.p			Pickle File		The random forest model trained during the Human in the Loop step in the City Name Correction task
Cities_Replication.py		Python Code: Analysis	Performs the City Name Correction task
evaluation_set.csv		Output Data		The hand-labeled test set used in the Interest Group Ideology task
feature_importances_by_iteration.csv	Output Data	A table tracking feature importances, which task that iteration applies to, and the number of positive and negative instances by iteration
fig1.pdf 				Output Figure 	Main Document Figure 1
figb1.pdf 				Output Figure 	Appendix Figure B1
figb2.png 				Output Figure 	Appendix Figure B2
final_results.csv		Output Data		The final results for the Interest Group Ideology task
full_train_set.csv		Processed Data		The processed training set of 71,043 labeled string pairs
functions.py			Python Code		A python script with helper functions
hand_labeled_matches.csv	Raw Data		A table of 357 hand-labeled matches, used to generate full_train_set.csv
incumbent_analysis.R 		R code: plots	Produes Appendix Table B3
Incumbent_voting_Replication.py	Python Code		Performs the Incumbent Voting task
incumbent.csv			Raw Data		Data table used to form the string pairs for the Incumbent Voting matching task
iter0_model.p			Pickle File		Random forest model trained on the 71,043 training set
iter1_model.p			Pickle File		Random forest model from the first human in the loop iteration of the Interest Group Ideology task
iter2_model.p			Pickle File		Random forest model from the second human in the loop iteration of the Interest Group Ideology task
iter3_model.p			Pickle File		Random forest model from the third human in the loop iteration of the Interest Group Ideology task
iter4_model.p      		Pickle File     Random forest model from the fourth human in the loop iteration of the Interest Group Ideology task
iter5_model.p			Pickle File		Random forest model from the fifth human in the loop iteration of the Interest Group Ideology task
iter6_model.p			Pickle File		Random forest model from the sixth human in the loop iteration of the Interest Group Ideology task
iter7_model.p			Pickle File		Random forest model from the seventh human in the loop iteration of the Interest Group Ideology task
iter8_model.p			Pickle File		Random forest model from the eighth human in the loop iteration of the Interest Group Ideology task
iter9_model.p			Pickle File		Random forest model from the ninth human in the loop iteration of the Interest Group Ideology task
iter10_model.p			Pickle File		Random forest model from the tenth human in the loop iteration of the Interest Group Ideology task
iter1_pairs.csv			Processed Data		Hand-labeled string pairs from the first HITL iteration of the Interest Group Ideology task
iter2_pairs.csv			Processed Data		Hand-labeled string pairs from the second HITL iteration of the Interest Group Ideology task
iter3_pairs.csv			Processed Data		Hand-labeled string pairs from the third HITL iteration of the Interest Group Ideology task
iter4_pairs.csv			Processed Data		Hand-labeled string pairs from the fourth HITL iteration of the Interest Group Ideology task
iter5_pairs.csv			Processed Data		Hand-labeled string pairs from the fifth HITL iteration of the Interest Group Ideology task
iter6_pairs.csv			Processed Data		Hand-labeled string pairs from the sixth HITL iteration of the Interest Group Ideology task
iter7_pairs.csv			Processed Data		Hand-labeled string pairs from the seventh HITL iteration of the Interest Group Ideology task
iter8_pairs.csv			Processed Data		Hand-labeled string pairs from the eighth HITL iteration of the Interest Group Ideology task
iter9_pairs.csv			Processed Data		Hand-labeled string pairs from the ninth HITL iteration of the Interest Group Ideology task
iter10_pairs.csv		Processed Data		Hand-labeled string pairs from the tenth HITL iteration of the Interest Group Ideology task
mturk_eval.csv 			Processed Data 		Raw data from Mechanical Turk used to generate training data
README.txt 			README 			This is the file you are currently reading
Survey_Final_Results.csv	Output Data		Results from the Incumbent Voting task
table1.R 					R Code 			Produces Table 1
table1.tex 					Output table 	Main Document Table 1
tableb1.tex 					Output table 	Appendix Table B1
tableb2.tex 					Output table 	Appendix Table B2
tableb3.tex 					Output table 	Appendix Table B3
tableb5.tex 					Output table 	Appendix Table B5
unrepresented.p			Pickle File		A list of 11 Amicus organization names used to generate full_train_set.csv
uscities.csv			Raw Data		Census data used to generate a list of correctly spelled city names

#######################
## Environment ########
#######################

MacOS                              10.13.4
python 							   3.6.9
pandas                             0.25.2
scikit-learn                       0.21.3
textdistance                       4.1.5
more-itertools                     7.2.0
numpy                              1.16.3
pickleshare                        0.7.5
matplotlib                         3.1.1

R 								   4.0.2
pROC 							   1.16.2
RecordLinkage 					   0.4.12.1
xtable  						   1.8.4
stringdist    					   0.9.6
tidyverse                  		   1.3.0